A Time Series Overview with Quarto

Dave Hurst - Director of Data Science Platforms

2024-10-10

Overview

This presentation offers a friendly introduction to Time Series, designed for everyone at ProCogia. We’ll begin with a brief introduction to Dave, who is leading our Data Science platform strategy. From there, we’ll explore what sets time series data apart from other types of data, dive into analysis and forecasting using an R framework, and showcase engaging examples of time series from current events. Finally, we’ll conclude with a discussion on Quarto, the technical communication tool used to create this presentation.

Please note: This presentation includes data from the Gun Violence Archive, and some discussion around shooting fatalities which some may find upsetting.

Agenda

Time Series

  • What are time series?
  • What problems can it solve?
  • tidyverts a ts framework for R
  • Time series in the context of AI

graph TD
    AI["Artificial Intelligence"] --> ML["Machine Learning"]  
    ML --> DS["Data Science"] 
    DS --> TS["Time Series"]

Quarto

  • An open-source scientific and technical publishing system

Time Series Overview

Definition

  • A time series is a sequence of data points collected at regular time intervals, used to identify patterns, trends, or make forecasts over time.

Use Cases

  • Segmented product sales over time.
  • County pothole repair records.
  • Monitoring stock prices and financial markets.
  • Analyzing sensor data in IoT systems.

Example Time Series Forecast

TS Forecast vs Analysis: Digression

Data Science

R for Data Science (2e) - Whole Game

R for Data Science (2e) - Whole Game

Data Analytics

  • Data Analytics = Data Science - Modeling
  • Business Intelligence = Data Analytics + DataOps

TS Forecast vs Analysis

Introducing the tidyverts

tidyverts packages

https://otexts.com/fpp3/

https://otexts.com/fpp3/

TS Forecast vs Analysis

  • \(Time Series Forecast \Rightarrow Modeling\)
Code
library(fpp3) # tidyverts

# Example: Monthly Air Passenger Data
air_passengers <- as_tsibble(AirPassengers)

# Simple Forecasting Model
fit <- air_passengers |>
  model(ETS(value ~ error() + trend() + season()))

# Forecast the next 12 periods
fc <- fit %>% forecast(h = 12)

# Plot the time series with forecast
autoplot(air_passengers, value) +
  autolayer(fc) +
  labs(title = "Air Passenger Forecast",
       y = "Passengers",
       x = "Year")

  • \(Time Series Analysis \Rightarrow Decomposition/Segmentation\)

TS Decomposition

air_passengers |> 
    model(stl = STL(value)) |> 
    components() |> 
    autoplot()

A More ‘Real World’ Dataset

Courtesy of Gun Violence Archive

This is an initial look at the Malcolm Gladwell assertion from Revisionist History podcast Guns Part 4: Moral Hazard. The assertion is that gun homicide statistics, which are decreasing, are masking a rise in total gun violence incidents because non-fatal shooting are not counted in the data.

Note: the data from the Gun Violence Archive includes both fatality and injury-only incidents and is compiled from media reports. Incidents that were reported to the police, but not found in public reports is not included.

Incident ID Date State City severity fatality
2726996 2023-10-10 California Los Angeles Victim Injury FALSE
2726951 2023-10-10 California Los Angeles Victim Injury FALSE
2727033 2023-10-09 California Rancho Cordova Victim Fatality TRUE
2726961 2023-10-09 California Bakersfield Victim Injury FALSE
2726332 2023-10-09 California Oakland Victim Injury FALSE
2726327 2023-10-09 California Oakland Victim Injury FALSE
2726295 2023-10-09 California West Hollywood Victim Injury FALSE
2726301 2023-10-09 California Los Angeles Victim Injury FALSE
2726921 2023-10-09 California San Francisco Victim Fatality TRUE
2726173 2023-10-08 California Los Angeles Victim Injury FALSE
2726359 2023-10-08 California Stockton Victim Fatality TRUE
2727111 2023-10-08 California Santa Ana Victim Fatality TRUE
2726310 2023-10-08 California West Covina Victim Fatality TRUE
2726362 2023-10-08 California Sacramento Victim Fatality TRUE
2726291 2023-10-08 California Fresno Victim Injury FALSE
2725386 2023-10-07 California Long Beach Victim Injury FALSE
2725869 2023-10-07 California Oakland Victim Injury FALSE
2725880 2023-10-07 California Oakland Victim Injury FALSE
2725110 2023-10-07 California Canoga Park Suspect Fatality TRUE
2725878 2023-10-07 California Oakland Victim Injury FALSE
2725876 2023-10-07 California Oakland Victim Injury FALSE
2726376 2023-10-07 California San Francisco Victim Fatality TRUE
2725669 2023-10-07 California Inglewood Victim Fatality TRUE
2725113 2023-10-07 California Los Angeles Victim Injury FALSE
2725117 2023-10-06 California Gardena Victim Injury FALSE
2724595 2023-10-06 California Los Angeles Victim Fatality TRUE
2724655 2023-10-06 California Santa Barbara Victim Injury FALSE
2725119 2023-10-06 California Richmond Victim Fatality TRUE
2725115 2023-10-06 California Los Angeles Victim Injury FALSE
2724590 2023-10-05 California Downey Victim Injury FALSE
2724584 2023-10-05 California Long Beach Victim Injury FALSE
2724524 2023-10-05 California Lodi Victim Injury FALSE
2723769 2023-10-05 California Fresno Suspect Injury FALSE
2723728 2023-10-05 California Rancho Murieta Victim Injury FALSE
2723457 2023-10-04 California Los Angeles Victim Injury FALSE
2723347 2023-10-04 California Los Angeles Victim Fatality TRUE
2722903 2023-10-04 California Stockton Victim Injury FALSE
2722741 2023-10-04 California Los Angeles Victim Injury FALSE
2723427 2023-10-04 California Long Beach Victim Fatality TRUE
2721653 2023-10-03 California Los Angeles Victim Injury FALSE
2722524 2023-10-03 California Anaheim Suspect Fatality TRUE
2722173 2023-10-03 California Lodi Victim Fatality TRUE
2722899 2023-10-03 California Stockton Victim Injury FALSE
2721835 2023-10-03 California Sacramento Victim Injury FALSE
2722511 2023-10-03 California Los Angeles Victim Injury FALSE
2722731 2023-10-03 California Stockton Victim Injury FALSE
2721666 2023-10-03 California Lancaster Suspect Injury FALSE
2721380 2023-10-02 California Los Angeles Victim Injury FALSE
2723912 2023-10-02 California North Hills Victim Injury FALSE
2721677 2023-10-02 California Atwater Victim Fatality TRUE

gva_ts_yr <- gva3state |> 
    filter(Date < ymd('2023-10-01'),
           Date >= ymd('2020-10-01')) |> 
    count(Date, State, fatality, name = 'Incidents') |> 
    as_tsibble(index = Date,
               key = c(State, fatality)) |> 
    fill_gaps(Incidents = 0L)

gva_ts_yr_plt <- gva_ts_yr |> 
    model(stl = STL(Incidents)) |> 
    components() |> 
    autoplot() 
gva_ts_yr_plt

library(plotly)
ggplotly(gva_ts_yr_plt)

Forecasting Multiple Cities

gva_3city <- gva3state |> 
    filter(Date < ymd('2023-10-01'),
           Date >= ymd('2020-10-01'),
           City %in% c('Chicago', 'Los Angeles', 'Indianapolis')) |> 
    count(Date, City, name = 'Incidents') |> 
    as_tsibble(index = Date,
               key = c(City)) |> 
    fill_gaps(Incidents = 0L)

gva_3city_train <- gva_3city |> filter(Date < ymd('2023-09-01'))
gva_3city_test <- gva_3city |> filter(Date >= ymd('2023-09-01'))

gva_3city_fit <- gva_3city_train |> 
    model(ETS = ETS(Incidents))

# Forecast the next 12 periods
gva_3city_fc <- gva_3city_fit %>% forecast(h = 30)

# Plot the time series with forecast
gva_3city_train |> 
    filter(Date >= ymd('2023-07-01')) |> 
    autoplot(Incidents) +
    autolayer(gva_3city_fc, Incidents, alpha = 0.5) +
    autolayer(gva_3city_test, Incidents, alpha = 0.3)+
    labs(title = "Metropolis GVA September Incidents Forecast") +
    facet_wrap(. ~ City, ncol = 1, scales = "free_y")

Forecasting Multiple Cities

Chicago

Chicago Forecast - ETS

Exponential Smoothing

chi_fit <- gva_chi_train |> 
  model(ETS(Incidents))

# Forecast the next 12 periods
chi_fc <- chi_fit %>% forecast(h = 30)

# Plot the time series with forecast
gva_chi_train |> 
    filter(year(Date) == 2023) |> 
    autoplot(Incidents) +
    autolayer(chi_fc, Incidents, alpha = 0.5) +
    autolayer(gva_chi_test, Incidents, alpha = 0.3)+
    labs(title = "Chicago GVA September Incidents Forecast")

Chicago Forecast - ETS

ARIMA Models

AR - Auto Regression

  • Regression model using preceding time lag(s)

I - Integrated

  • Compute Differences to remove trend, seasonality, variance (stationarity)

MA - Moving Average (Model)

  • Maximum Likelihood Estimatino (MLE) model using the residual (error) term(s)

ARIMA (and SARIMA) Models

ARIMA(y ~ pdq() + PDQ() )

AR(p) where:

\[ X_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \dots + \phi_p X_{t-p} + \epsilon_t \]

MA(q)

\[ X_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \dots + \theta_q \epsilon_{t-q} \]

d - is the number of differences
P,D,Q - are the seasonal terms

ARIMA Process

graph TD
    classDef boxStyle padding:30px;

    Step1["Plot the data. Understand the Patterns"] --> Step2["Stablize Variables (Box-Cox)"]
    Step2 --> Choose["Select Method"]
    Choose --> |Auto| StepA1["Use ARIMA()"]
    Choose --> |Manual| StepB1["Difference to make series stationary"]
    StepB1 --> StepB2["Plot ACF/PACF"]
    StepB2 --> StepB3["Use ARIMA(var ~ pdq() + PDQ())"]
    StepB3 --> Residuals["Check Residuals and ACF"]
    StepA1 --> Residuals
    Residuals --> |No| StepB2
    Residuals --> |Yes| Forecast["Calculate Forcast"]
    
    style Choose fill:#9f9,stroke:#333,stroke-width:2px,rx:50,ry:50;
    style Residuals fill:#9f9,stroke:#333,stroke-width:2px,rx:50,ry:50;

ARIMA - Stationarity Example

ARIMA - Stationarity Example

air_passengers |> 
    mutate(log_v = log(value)) |> 
    autoplot(log_v) 

Constant variance, but trend and seasonality are present.

ARIMA - Stationarity Example

air_passengers |> 
    mutate(log_v = log(value),
           D1 = difference(log_v, lag = 12)) |>
    autoplot(D1) 

Seasonality is removed, but trend exists.

ARIMA - Stationarity Example

air_passengers |> 
    mutate(log_v = log(value),
           D1 = difference(log_v, lag = 12),
           D1d1 = difference(D1, lag = 1)) |>
    autoplot(D1d1) 

Looks like whith noise. Let’s do a statistical test.

\(H_0\) is that series is stationary:

air_passengers |> 
    mutate(log_v = log(value),
           D1 = difference(log_v, lag = 12),
           D1d1 = difference(D1, lag = 1)) |>
    features(D1d1, unitroot_kpss) 
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0844         0.1

Chicago Forecast - ARIMA

gva_chi_train |> 
    gg_tsdisplay(y = Incidents |> 
                     difference(7), 
                 plot_type = 'partial')

Chicago Forecast - ARIMA

chi_fit <- gva_chi_train |> 
  model(auto_arima = ARIMA(Incidents),
        #auto_arima2 = ARIMA(Incidents, stepwise = FALSE, approx = FALSE),
        auto_dfix = ARIMA(Incidents ~ 0 + pdq(1,1,1) + PDQ(0,1,2)),
        arima_ar = ARIMA(Incidents ~ 0 + pdq(2,0,0) + PDQ(4,1,0)),
        arima_ma = ARIMA(Incidents ~ 0 + pdq(0,0,2) + PDQ(0,1,1)),
        ets = ETS(Incidents),
        snaive = SNAIVE(Incidents))

# Forecast the next 12 periods
chi_fc <- chi_fit %>% forecast(h = 30)

chi_fit |> pivot_longer(everything())
# A mable: 6 x 2
# Key:     name [6]
  name                          value
  <chr>                       <model>
1 auto_arima <ARIMA(1,1,1)(0,0,2)[7]>
2 auto_dfix  <ARIMA(1,1,1)(0,1,2)[7]>
3 arima_ar   <ARIMA(2,0,0)(4,1,0)[7]>
4 arima_ma   <ARIMA(0,0,2)(0,1,1)[7]>
5 ets                    <ETS(A,N,A)>
6 snaive                     <SNAIVE>

Chicago Forecast - ARIMA output

.model RMSE
ets 2.446857
auto_dfix 2.448216
arima_ma 2.763591
auto_arima 2.857696
arima_ar 3.010882
snaive 4.151305

Next Steps

  • Explore advanced models
    • fable::prophet()
    • ML algorithms nnetar, xgboost
    • Time Series LLM
  • Get it out there! (with Quarto)

Q&A

  • Python equivalent to tidyverts?
  • Getting started with Quarto
    • on VS Code or Jupyter Notebooks